-
Notifications
You must be signed in to change notification settings - Fork 2k
[BUG]: Delete collection resource leak (single-node Chroma) #3297
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG]: Delete collection resource leak (single-node Chroma) #3297
Conversation
Reviewer ChecklistPlease leverage this checklist to ensure your code review is thorough before approving Testing, Bugs, Errors, Logs, Documentation
System Compatibility
Quality
|
This stack of pull requests is managed by Graphite. Learn more about stacking. |
2e113a0 to
b53dadb
Compare
rohitcpbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for identifying the leak and raising the fix. I did not see this earlier so did not review earlier. my miss. Reviewed it now.
chromadb/api/segment.py
Outdated
| ) | ||
|
|
||
| if existing: | ||
| segments = self._sysdb.get_segments(collection=existing[0].id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we should try not to call sysdb for getting segments. It adds extra call to the backend for distributed chroma.
Seeing the current code, I see we are already calling sysdb.get_segments() from the manager, so you are simply moving that line here, and not adding extra calls. But i feel we can do better.
Do you think we should just call delete_segment() from delete_collection() ?
So we can add this snippet back -
for s in self._manager.delete_segments(existing[0]["id"]):
self._sysdb.delete_segment(s)
and do a no-op inside delete_segments() in db/impl/grpc/client.py
Will that fix the leak ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using this snippet:
for s in self._manager.delete_segments(existing[0]["id"]):
self._sysdb.delete_segment(s)Makes sense however we revert back to a non-atomic deletion of sysdb resources. In the above snippet we'd delete the segments separately from deleting the collection, which I wanted to avoid on purpose which is why I pulled the get of the segments here before the were atomically deleted as part of self._sysdb.delete_collection.
Why do you think that this would cause extra calls in the distributed backend?
b53dadb to
ba07228
Compare
| def delete_segments(self, collection_id: UUID) -> Sequence[UUID]: | ||
| segments = self._sysdb.get_segments(collection=collection_id) | ||
| return [s["id"] for s in segments] | ||
| return [] # noop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@HammadB, talked with @rohitcpbot and he mentioned that this should be noop, is this fine or should I revert back to the older version with distributed sysdb query?
chromadb/api/segment.py
Outdated
| ) | ||
|
|
||
| if existing: | ||
| self._manager.delete_segments(collection_id=existing[0].id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rohitcpbot, this is the actual change as we discussed. rest is just black formatting changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tazarov.
If possible leave a note with the following comment or similar -
"""
This call will delete segment related data that is stored locally and cannot be part of the atomic SQL transaction.
It is a NoOp for the distributed sysdb implementation.
Omitting this call will lead to leak of segment related resources.
"""
Can you answer something for me - If the process crashes immediately after self._manager.delete_segments(collection_id=existing[0].id)
Then the actual entries in SQL are not deleted, which means the collection is not deleted.
Now if user issues a Get or Query, will the local manager work correctly ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the fix to make local manager work in the above failure scenario is non trivial then we could leave a note here, and take it up as a separate task. But it will be good to know the state of the Db with above change.
The same scenario would had to be thought through even with your earlier changes of doing the local manager delete after the sysdb delete... where the sql could have gone through but the local manager did not because of a crash.. leading to a leak.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rohitcpbot, local manager has two segments for each collection:
- sqlite - this will actually delete the segment from
segments-def delete(self) -> None: - hnsw - it will delete the directory where the the HNSW is stored but will not delete the segment from
segmentsdir
So here is a diagram to explain the point of failure:
The main problem as I see it in the current impl (with possible solutions):
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the only foolproof way to remove it all is possibly to wrap it all in a single transaction all the way from segment. Then if the physical dir removal fails we'll rollback the whole sqlite transaction.
As a side note, on Windows deleting the segment dir right after closing file handles frequently fails.
|
Thanks @tazarov, i left a comment to add a comment, and also a question. We should be good to merge after that. |
9fd2b3c to
788a07f
Compare
After some deliberation I think a good course of action is to make it all atomic by making the self._sysdb.delete_collection(
existing[0].id, tenant=tenant,
database=database,
lambda collection_id: self._manager.delete_segments(collection_id=collection_id)
)and inside Wdyt? |
|
As discussed offline here are two promising approaches: callbackself._sysdb.delete_collection(
existing[0].id, tenant=tenant, database=database, lambda collection_id: self._manager.delete_segments(collection_id=collection_id)
)Possible failures points:
return segmentssegments = self._sysdb.delete_collection(
existing[0].id, tenant=tenant, database=database
)
self._manager.delete_segments(segments=segments)Possible failure points:
Tip In the above read let me know once you get the chance to discuss it in the office. |
|
@tazarov - as discussed offline, lets just fix the leaks for now and then we’ll see how the atomicity will work across various db entries and files. can open a new ticket for fixing the atomicity and assign to you or me. thanks! |
788a07f to
6163095
Compare
|
@rohitcpbot, I think this is it. The simplest form of the fix where we just change the places of deleting segments first then sysdb. |
# Conflicts: # chromadb/api/segment.py
c1d60e3 to
6fdf23d
Compare
rohitcpbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ore#3297) ## Description of changes Closes chroma-core#3296 The delete collection logic slightly changes to accomodate the fix without breaking the transactional integrity of `self._sysdb.delete_collection`. The `chromadb.segment.SegmentManager.delete_segments` had to change to accept the list of segments to delete instead of `collection_id`.  *Summarize the changes made by this PR.* - Improvements & Bug fixes - Fixes the resource leak when deleting a collection ## Test plan *How are these changes tested?* - [x] Tests pass locally with `pytest` for python, `yarn test` for js, `cargo test` for rust ## Documentation Changes N/A





Description of changes
Closes #3296
The delete collection logic slightly changes to accomodate the fix without breaking the transactional integrity of
self._sysdb.delete_collection. Thechromadb.segment.SegmentManager.delete_segmentshad to change to accept the list of segments to delete instead ofcollection_id.Summarize the changes made by this PR.
Test plan
How are these changes tested?
pytestfor python,yarn testfor js,cargo testfor rustDocumentation Changes
N/A